This paper studies the problem of designing compact binary architectures for vision multi-layer perceptrons (MLPs). We provide extensive analysis on the difficulty of binarizing vision MLPs and find that previous binarization methods perform poorly due to limited capacity of binary MLPs. In contrast with the traditional CNNs that utilizing convolutional operations with large kernel size, fully-connected (FC) layers in MLPs can be treated as convolutional layers with kernel size $1\times1$. Thus, the representation ability of the FC layers will be limited when being binarized, and places restrictions on the capability of spatial mixing and channel mixing on the intermediate features. To this end, we propose to improve the performance of binary MLP (BiMLP) model by enriching the representation ability of binary FC layers. We design a novel binary block that contains multiple branches to merge a series of outputs from the same stage, and also a universal shortcut connection that encourages the information flow from the previous stage. The downsampling layers are also carefully designed to reduce the computational complexity while maintaining the classification performance. Experimental results on benchmark dataset ImageNet-1k demonstrate the effectiveness of the proposed BiMLP models, which achieve state-of-the-art accuracy compared to prior binary CNNs. The MindSpore code is available at \url{https://gitee.com/mindspore/models/tree/master/research/cv/BiMLP}.
translated by 谷歌翻译
Adder Neural Network (AdderNet) provides a new way for developing energy-efficient neural networks by replacing the expensive multiplications in convolution with cheaper additions (i.e.l1-norm). To achieve higher hardware efficiency, it is necessary to further study the low-bit quantization of AdderNet. Due to the limitation that the commutative law in multiplication does not hold in l1-norm, the well-established quantization methods on convolutional networks cannot be applied on AdderNets. Thus, the existing AdderNet quantization techniques propose to use only one shared scale to quantize both the weights and activations simultaneously. Admittedly, such an approach can keep the commutative law in the l1-norm quantization process, while the accuracy drop after low-bit quantization cannot be ignored. To this end, we first thoroughly analyze the difference on distributions of weights and activations in AdderNet and then propose a new quantization algorithm by redistributing the weights and the activations. Specifically, the pre-trained full-precision weights in different kernels are clustered into different groups, then the intra-group sharing and inter-group independent scales can be adopted. To further compensate the accuracy drop caused by the distribution difference, we then develop a lossless range clamp scheme for weights and a simple yet effective outliers clamp strategy for activations. Thus, the functionality of full-precision weights and the representation ability of full-precision activations can be fully preserved. The effectiveness of the proposed quantization method for AdderNet is well verified on several benchmarks, e.g., our 4-bit post-training quantized adder ResNet-18 achieves an 66.5% top-1 accuracy on the ImageNet with comparable energy efficiency, which is about 8.5% higher than that of the previous AdderNet quantization methods.
translated by 谷歌翻译
The combination of transformers and masked image modeling (MIM) pre-training framework has shown great potential in various vision tasks. However, the pre-training computational budget is too heavy and withholds the MIM from becoming a practical training paradigm. This paper presents FastMIM, a simple and generic framework for expediting masked image modeling with the following two steps: (i) pre-training vision backbones with low-resolution input images; and (ii) reconstructing Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images. In addition, we propose FastMIM-P to progressively enlarge the input resolution during pre-training stage to further enhance the transfer results of models with high capacity. We point out that: (i) a wide range of input resolutions in pre-training phase can lead to similar performances in fine-tuning phase and downstream tasks such as detection and segmentation; (ii) the shallow layers of encoder are more important during pre-training and discarding last several layers can speed up the training stage with no harm to fine-tuning performance; (iii) the decoder should match the size of selected network; and (iv) HOG is more stable than RGB values when resolution transfers;. Equipped with FastMIM, all kinds of vision backbones can be pre-trained in an efficient way. For example, we can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones. Compared to previous relevant approaches, we can achieve comparable or better top-1 accuracy while accelerate the training procedure by $\sim$5$\times$. Code can be found in https://github.com/ggjy/FastMIM.pytorch.
translated by 谷歌翻译
Models should be able to adapt to unseen data during test-time to avoid performance drops caused by inevitable distribution shifts in real-world deployment scenarios. In this work, we tackle the practical yet challenging test-time adaptation (TTA) problem, where a model adapts to the target domain without accessing the source data. We propose a simple recipe called \textit{Data-efficient Prompt Tuning} (DePT) with two key ingredients. First, DePT plugs visual prompts into the vision Transformer and only tunes these source-initialized prompts during adaptation. We find such parameter-efficient finetuning can efficiently adapt the model representation to the target domain without overfitting to the noise in the learning objective. Second, DePT bootstraps the source representation to the target domain by memory bank-based online pseudo-labeling. A hierarchical self-supervised regularization specially designed for prompts is jointly optimized to alleviate error accumulation during self-training. With much fewer tunable parameters, DePT demonstrates not only state-of-the-art performance on major adaptation benchmarks VisDA-C, ImageNet-C, and DomainNet-126, but also superior data efficiency, i.e., adaptation with only 1\% or 10\% data without much performance degradation compared to 100\% data. In addition, DePT is also versatile to be extended to online or multi-source TTA settings.
translated by 谷歌翻译
本文研究了重量和激活都将二进制神经网络(BNN)二进制为1位值,从而大大降低了记忆使用率和计算复杂性。由于现代深层神经网络具有复杂的设计,具有复杂的架构,其准确性,因此权重和激活分布的多样性非常高。因此,传统的符号函数不能很好地用于有效地在BNN中进行全精度值。为此,我们提出了一种称为Adabin的简单而有效的方法,可自适应获得最佳的二进制集$ \ {b_1,b_2 \} $($ b_1,b_1,b_2 \ in \ mathbb {r} $)的重量和激活而不是固定集(即$ \ { - 1,+1 \} $)。通过这种方式,提出的方法可以更好地拟合不同的分布,并提高二进制特征的表示能力。实际上,我们使用中心位置和1位值的距离来定义新的二进制量化函数。对于权重,我们提出了一种均衡方法,将对称分布的对称中心与实价分布相对,并最大程度地减少它们的kullback-leibler差异。同时,我们引入了一种基于梯度的优化方法,以获取这两个激活参数,这些参数以端到端的方式共同训练。基准模型和数据集的实验结果表明,拟议的Adabin能够实现最新性能。例如,我们使用RESNET-18体系结构在Imagenet上获得66.4 \%TOP-1的精度,并使用SSD300获得了Pascal VOC的69.4映射。
translated by 谷歌翻译
网络架构在基于深度学习的计算机视觉系统中起关键作用。广泛使用的卷积神经网络和变压器将图像视为网格或序列结构,该网格或序列结构并非灵活以捕获不规则和复杂的对象。在本文中,我们建议将图像表示为图形结构,并引入新的视觉GNN(VIG)体系结构,以提取视觉任务的图形级特征。我们首先将图像拆分为许多被视为节点的补丁,然后通过连接最近的邻居来构造图形。根据图像的图表表示,我们构建了VIG模型以在所有节点之间转换和交换信息。 VIG由两个基本模块组成:用于汇总和更新图形信息的图形卷积的图形模块,以及带有两个线性层的FFN模块用于节点特征转换。 VIG的各向同性和金字塔体系结构均具有不同的型号。关于图像识别和对象检测任务的广泛实验证明了我们的VIG架构的优势。我们希望GNN关于一般视觉任务的开创性研究将为未来的研究提供有用的灵感和经验。 pytorch代码可在https://github.com/huawei-noah/effficity-ai-backbones上获得,Mindspore代码可在https://gitee.com/mindspore/models上获得。
translated by 谷歌翻译
已经出现了许多变形金刚的改编,以解决单模式视觉任务,在该任务中,自我发项模块被堆叠以处理图像之类的输入源。直观地,将多种数据馈送到视觉变压器可以提高性能,但是内模式的专注权也可能会稀释,从而可能破坏最终性能。在本文中,我们提出了一种针对基于变压器的视力任务的多模式令牌融合方法(TokenFusion)。为了有效地融合多种方式,TokenFusion动态检测非信息令牌,并用投影和聚合的模式间特征将这些令牌替换为这些令牌。还采用了残留位置对准来实现融合后模式间比对的明确利用。 TokenFusion的设计使变压器能够学习多模式特征之间的相关性,而单模式变压器体系结构基本上保持完整。对各种均质和异构方式进行了广泛的实验,并证明TokenFusion在三个典型的视觉任务中超过了最新方法:多模式图像到图像到图像到图像转换,RGB深度语义分段和3D对象检测3D对象检测点云和图像。我们的代码可从https://github.com/yikaiw/tokenfusion获得。
translated by 谷歌翻译
自上而下的实例分割框架与自下而上的框架相比,它在对象检测方面表现出了优越性。虽然它有效地解决了过度细分,但自上而下的实例分割却遭受了过度处理问题。然而,完整的分割掩模对于生物图像分析至关重要,因为它具有重要的形态特性,例如形状和体积。在本文中,我们提出了一个区域建议纠正(RPR)模块,以解决这个具有挑战性的分割问题。特别是,我们提供了一个渐进式皇家模块,以逐渐将邻居信息引入一系列ROI。 ROI功能被馈入专门的进料网络(FFN)以进行提案框回归。有了其他邻居信息,提出的RPR模块显示了区域建议位置的校正显着改善,因此与最先进的基线方法相比,在三个生物图像数据集上表现出有利的实例分割性能。实验结果表明,所提出的RPR模块在基于锚固的和无锚的自上而下实例分割方法中有效,这表明该方法可以应用于生物学图像的一般自上而下实例分割。代码可用。
translated by 谷歌翻译
由于存储器和计算资源有限,部署在移动设备上的卷积神经网络(CNNS)是困难的。我们的目标是通过利用特征图中的冗余来设计包括CPU和GPU的异构设备的高效神经网络,这很少在神经结构设计中进行了研究。对于类似CPU的设备,我们提出了一种新颖的CPU高效的Ghost(C-Ghost)模块,以生成从廉价操作的更多特征映射。基于一组内在的特征映射,我们使用廉价的成本应用一系列线性变换,以生成许多幽灵特征图,可以完全揭示内在特征的信息。所提出的C-Ghost模块可以作为即插即用组件,以升级现有的卷积神经网络。 C-Ghost瓶颈旨在堆叠C-Ghost模块,然后可以轻松建立轻量级的C-Ghostnet。我们进一步考虑GPU设备的有效网络。在建筑阶段的情况下,不涉及太多的GPU效率(例如,深度明智的卷积),我们建议利用阶段明智的特征冗余来制定GPU高效的幽灵(G-GHOST)阶段结构。舞台中的特征被分成两个部分,其中使用具有较少输出通道的原始块处理第一部分,用于生成内在特征,另一个通过利用阶段明智的冗余来生成廉价的操作。在基准测试上进行的实验证明了所提出的C-Ghost模块和G-Ghost阶段的有效性。 C-Ghostnet和G-Ghostnet分别可以分别实现CPU和GPU的准确性和延迟的最佳权衡。代码可在https://github.com/huawei-noah/cv-backbones获得。
translated by 谷歌翻译
变压器网络对计算机视觉任务取得了很大的进步。变压器 - 变压器(TNT)架构利用内部变压器和外部变压器提取本地和全局表示。在这项工作中,我们通过引入两个先进的设计:1)金字塔架构和2)卷积阀。通过建立分层表示,新的“金字塔”显着改善了原始TNT。Pyramidtnt比以前的最先进的视觉变压器(如Swin Transformer)实现更好的表演。我们希望这一新基线能够有助于视觉变压器的进一步研究和应用。代码将在https://github.com/huawei-noah/cv-backbones/tree/master/tnt_pytorch获得。
translated by 谷歌翻译